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Abstract 

Variable screening is a fast dimension reduction technique for assisting high dimen¬ 
sional feature selection. As a preselection method, it selects a moderate size subset of 
candidate variables for further refining via feature selection to produce the final model. 
The performance of variable screening depends on both computational efficiency and 
the ability to dramatically reduce the number of variables without discarding the im¬ 
portant ones. When the data dimension p is substantially larger than the sample size n, 
variable screening becomes crucial as 1) Faster feature selection algorithms are needed; 
2) Conditions guaranteeing selection consistency might fail to hold. This article studies 
a class of linear screening methods and establishes consistency theory for this special 
class. In particular, we prove the restricted diagonally dominant (RDD) condition is a 
necessary and sufficient condition for strong screening consistency. As concrete exam¬ 
ples, we show two screening methods SIS and HOLP are both strong screening consis¬ 
tent (subject to additional constraints) with large probability if n > 0((ps+cr /r) 2 logp) 
under random designs. In addition, we relate the RDD condition to the irrepresentable 
condition, and highlight limitations of SIS. 


1 Introduction 

The rapidly growing data dimension has brought new challenges to statistical variable se¬ 
lection, a crucial technique for identifying important variables to facilitate interpretation 
and improve prediction accuracy. Recent decades have witnessed an explosion of research in 
variable selection and related fields such as compressed sensing BIZI. with a core focus on 
regularized methods p m u ini m. Regularized methods can consistently recover the sup¬ 
port of coefficients, i.e., the non-zero signals, via optimizing regularized loss functions under 
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certain conditions |Hl EJ Ho]. However, in the big data era when p far exceeds n, such regu¬ 
larized methods might fail due to two reasons. First, the conditions that guarantee variable 
selection consistency for convex regularized methods such as lasso might fail to hold when 
p » n; Second, the computational expense of both convex and non-convex regularized 
methods increases dramatically with large p. 

Bearing these concerns in mind, HH propose the concept of “variable screening”, a fast 
technique that reduces data dimensionality from p to a size comparable to n, with all predic¬ 
tors having non-zero coefficients preserved. They propose a marginal correlation based fast 
screening technique “Sure Independence Screening” (SIS) that can preserve signals with 
large probability. However, this method relies on a strong assumption that the marginal 
correlations between the response and the important predictors are high m, which is eas¬ 
ily violated in the practice. [T2J extends the marginal correlation to the Spearman’s rank 
correlation, which is shown to gain certain robustness but is still limited by the same strong 
assumption. [T3| and |T4] take a different approach to attack the screening problem. They 
both adopt variants of a forward selection type algorithm that includes one variable at a 
time for constructing a candidate variable set for further refining. These methods eliminate 
the strong marginal assumption in CH and have been shown to achieve better empirical per¬ 
formance. However, such improvement is limited by the extra computational burden caused 
by their iterative framework, which is reported to be high when p is large P3- To ameliorate 
concerns in both screening performance and computational efficiency, P3 develop a new type 
of screening method termed “High-dimensional ordinary least-square projection” ( HOLP). 
This new screener relaxes the strong marginal assumption required by SIS and can be 
computed efficiently (complexity is 0(n 2 p)), thus scalable to ultra-high dimensionality. 

This article focuses on linear models for tractability. As computation is one vital concern 
for designing a good screening method, we primarily focus on a class of linear screeners that 
can be efficiently computed, and study their theoretical properties. The main contributions 
of this article lie in three aspects. 

1. We define the notion of strong screening consistency to provide a unified framework 
for analyzing screening methods. In particular, we show a necessary and sufficient 
condition for a screening method to be strong screening consistent is that the screening 
matrix is restricted diagonally dominant (RDD). This condition gives insights into the 
design of screening matrices, while providing a framework to assess the effectiveness of 
screening methods. 

2. We relate RDD to other existing conditions. The irrepresentable condition (IC) [8] 
is necessary and sufficient for sign consistency of lasso [3j. In contrast to IC that is 
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specific to the design matrix, RDD involves another ancillary matrix that can be cho¬ 
sen arbitrarily. Such flexibility allows RDD to hold even when IC fails if the ancillary 
matrix is carefully chosen (as in HOLP ). When the ancillary matrix is chosen as the 
design matrix, certain equivalence is shown between RDD and IC, revealing the diffi¬ 
culty for SIS to achieve screening consistency. We also comment on the relationship 
between RDD and the restricted eigenvalue condition (REC) [6] which is commonly 
seen in the high dimensional literature. We illustrate via a simple example that RDD 
might not be necessarily stronger than REC. 

3. We study the behavior of SIS and HOLP under random designs, and prove that a 
sample size of n — 0((ps + a/r) 2 logp) is sufficient for SIS and HOLP to be screening 
consistent, where s is the sparsity, p measures the diversity of signals and t/ct evaluates 
the signal-to-noise ratio. This is to be compared to the sign consistency results in [9] 
where the design matrix is fixed and assumed to follow the IC. 

The article is organized as follows. In Section 1, we set up the basic problem and describe 
the framework of variable screening. In Section 2, we provide a deterministic necessary 
and sufficient condition for consistent screening. Its relationship with the irrepresentable 
condition is discussed in Section 3. In Section 4, we prove the consistency of SIS and HOLP 
under random designs by showing the RDD condition is satisfied with large probability, 
although the requirement on SIS is much more restictive. 


2 Linear screening 

Consider the usual linear regression 


Y = X/3 + e, 

where Y is the nxl response vector, X is the n x p design matrix and e is the noise. The 
regression task is to learn the coefficient vector j3. In the high dimensional setting where 
p » n, a sparsity assumption is often imposed on /? so that only a small portion of the 
coordinates are non-zero. Such an assumption splits the task of learning /3 into two phases. 
The first is to recover the support of /3, i.e., the location of non-zero coefficients; The second 
is to estimate the value of these non-zero signals. This article mainly focuses on the first 
phase. 

As pointed out in the introduction, when the dimensionality is too high, using regular¬ 
ization methods methods raises concerns both computationally and theoretically. To reduce 
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the dimensionality, HU suggest a variable screening framework by finding a submodel 

A id = {i ■ |A| is among the largest d coordinates of \(3\} or M. 1 = {i : \(3i\ > 7 }. 

Let Q = { 1 , 2 , •• • ,p} and define S as the true model with s = |5j being its cardinarlity. 
The hope is that the submodel size \AA d \ or |A4 7 | will be smaller or comparable to n, while 
S C J\4 d or S C j\4 7 . To achieve this goal two steps are usually involved in the screening 
analysis. The first is to show there exists some 7 such that miii^s \/3i\ >7 and the second 
step is to bound the size of \M. 7 \ such that |A4 7 | = 0{n). To unify these steps for a 
more comprehensive theoretical framework, we put forward a slightly stronger definition of 
screening consistency in this article. 

Definition 2.1. (Strong screening consistency) An estimator (3 (of (3) is strong screening 
consistent if it satisfies that 


min \(3i\ > max \(3i\ ( 1 ) 

i&S i£S 

and, 

sign0i ) = sign^f), Vi 6 S. ( 2 ) 

Remark 2.1. This definition does not differ much from the usual screening property studied 
in the literature, which requires min ie 5 |/3j| > rriax -^ 5 ' 1 \(3 l \, where max^^ denotes the k th 
largest item. 

The key of strong screening consistency is the property (JT]) that requires the estimator to 
preserve consistent ordering of the zero and non-zero coefficients. It is weaker than variable 
selection consistency in [ 8 ]. The requirement in ((2]) can be seen as a relaxation of the sign 
consistency defined in [ 8 J, as no requirement for (3i,i ^ S is needed. As shown later, such 
relaxation tremendously reduces the restriction on the design matrix, and allows screening 
methods to work for a broader choice of X. 

The focus of this article is to study the theoretical properties of a special class of screeners 
that take the linear form as 


(3 = AY 


for some p x n ancillary matrix A. Examples include sure independence screening (SIS) 
where A = X T /n and high-dimensional ordinary least-square projection ( HOLP ) where 
A = X t (A"A" t ) -1 . We choose to study the class of linear estimators because linear screening 
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is computationally efficient and theoretically tractable. We note that the usual ordinary 
least-squares estimator is also a special case of linear estimators although it is not well 
defined for p > n. 


3 Deterministic guarantees 


In this section, we derive the necessary and sufficient condition that guarantees f3 = AY to 
be strong screening consistent. The design matrix X and the error e are treated as fixed 
in this section and we will investigate random designs later. We consider the set of sparse 
coefficient vectors defined by 


B(s,p) 


|/3 G 1Z P : \supp((3)\ < s, 


maXi esU pp(/3) \Pi\ ^ 
HlhliGsupp(/3) \f3i\ 


The set B(s,p ) contains vectors having at most s non-zero coordinates with the ratio of the 
largest and smallest coordinate bounded by p. Before proceeding to the main result of this 
section, we introduce some terminology that helps to establish the theory. 

Definition 3.1. (restricted diagonally dominant matrix) A p x p symmetric matrix $ is 
restricted diagonally dominant with sparsity s if for any I C Q, |/| < s — 1 and i e Q\I 


> C 0 max l ^ | Qij + 


^ jei 

jei J 


\/k i, k e Q\I, 


where C 0 > 1 is a constant. 


Notice this definition implies that for i 6 Q \ I 

>C 0 yYl l $ v + $ *J-| + - ) / 2 > Co (3) 

^ jei jei ' jei 

which is related to the usual diagonally dominant matrix. The restricted diagonally dominant 
matrix provides a necessary and sufficient condition for any linear estimators f3 = AY to be 
strong screening consistent. More precisely, we have the following result. 

Theorem 1. For the noiseless case where e = 0, a linear estimator (3 = AY is strong 
screening consistent for every f3 G B(s,p), if and only if the screening matrix $ = AX is 
restricted diagonally dominant with sparsity s and C 0 > p. 


Proof. Assume <f> is restricted diagonally dominant with sparsity s and Co > p. Recall 
/ 3 = $/3. Suppose S is the index set of non-zero predictors. For any i e S, k (f S', if we let 


5 



I — S \ {*}, then we have 


iai = iai = iai 

V jeI Pi J l jei Pi jeI Pi 


> -iai ( 

3 & 


3 ei 

ft 

ft 


ki 


I fti 


fti 


3 ei 


^Pj$kj + ft$ ki ] = —sign(ft) ■ ft k , 


and 


iAi=ifti (*«+y f *«) = ifti {+y §• (t« - *«)-4 >«+e f-y+ 4 

V jeI Pi J K j&1 Pi jeI Pi 

> |A| = sign(ft) ■ ft k . 


Therefore, whatever value sign(fti) is, it always holds that |A| > |At| and thus min^s |A| > 
max^ s IAI- 

To prove the sign consistency for non-zero coefficients, we notice that for i G S, 


AA = *«# + 5>«AA = A 2 

jei 




V^$-- 

jei Pl 


> 0. 


The proof of necessity is left to the supplementary materials. 

□ 


The noiseless case is a good starting point to analyze ft. Intuitively, in order to preserve 
the correct order of the coefficients in ft = AX ft, one needs AX to be close to a diagonally 
dominant matrix, so that fti, i G Xi s will take advantage of the large diagonal terms in AX 
to dominate fti,i qL Xis that is just linear combinations of off-diagonal terms. 

When noise is considered, the condition in Theorem [1] needs to be changed slightly to 
accommodate extra discrepancies. In addition, the smallest non-zero coefficient has to be 
lower bounded to ensure a certain level of signal-to-noise ratio. Thus, we augment our 
previous definition of B(s,p ) to have a signal strength control 


&t(s,p) = {ft e B(s,p )| min |A| > t}- 

i£supp(/3) 

Then we can obtain the following modified Theorem. 

Theorem 2. With noise, the linear estimator ft = AY is strong screening consistent for 
every ft G B r (s,p ) if <3> = AX — 2T~ 1 \\Ae\\ 00 I p is restricted diagonally dominant with sparsity 
s and Co > p. 
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The proof of Theorem [2] is essentially the same as Theorem [0 and is thus left to the 
supplementary materials. The condition in Theorem [2] can be further tailored to a necessary 
and sufficient version with extra manipulation on the noise term. Nevertheless, this might 
not be useful in practice due to the randomness in noise. In addition, the current version of 
Theorem [2] is already tight in the sense that there exists some noise vector e such that the 
condition in Theorem [2] is also necessary for strong screening consistency. 

Theorems CD and [ 2 ] establish ground rules for verifying consistency of a given screener and 
provide practical guidance for screening design. In Section 4, we consider some concrete 
examples of ancillary matrix A and prove that conditions in Theorems |T] and [ 2 ] are satisfied 
by the corresponding screeners with large probability under random designs. 


4 Relationship with other conditions 

For some special cases such sure independence screening (” SIS”), the restricted diagonally 
dominant (RDD) condition is related to the strong irrepresentable condition (IC) proposed 
in [8]. Assume each column of X is standardized to have mean zero. Letting C = X T X/n 
and f3 be a given coefficient vector, the IC is expressed as 

||Cs=,sC£,s ' si 9™(Ps )||oo <1-0 (4) 

for some 9 > 0 , where Ca,b represents the sub-matrix of C with row indices in A and column 
indices in B. The authors enumerate several scenarios of C such that IC is satisfied. We 
verify some of these scenarios for screening matrix <E>. 

Corollary 1. If = 1, Vi and |$y| < c/(2s), Vi ^ j for some 0 < c < 1 as defined in 
Corollary 1 and 2 in m then & is a restricted diagonally dominant matrix with sparsity s 
and C 0 > 1 /c. 

If |$jj| < rI* - - 7 '!, Vi, j for some 0 < r < 1 as defined in Corollary 3 in m then & is a 
restricted diagonally dominant matrix with sparsity s and Cq > (1 — r) 2 /(4r). 

A more explicit but nontrivial relationship between IC and RDD is illustrated below 
when \S\ = 2 . 

Theorem 3. Assume = 1, Vi and < r, Vi ^ j. If d* is restricted diagonally 
dominant with sparsity 2 and Co > p, then $ satisfies 

p-i 

• sign(p s )\\oo < - 

’ 1 — r 
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for all (3 G 13(2, p). On the other hand, if $ satisfies the IC for all [3 G 13(2, p) for some 6, 
then $ is a restricted diagonally dominant matrix with sparsity 2 and 


C 0 > 


1 1 — r 

1 — 9 1 + r 


Theorem [3] demonstrates certain equivalence between IC and RDD. However, it does not 
mean that RDD is also a strong requirement. Notice that IC is directly imposed on the 
covariance matrix X T X/n. This makes IC a strong assumption that is easily violated; for 
example, when the predictors are highly correlated. In contrast to IC, RDD is imposed on 
matrix AX where there is flexibility in choosing A. Only when A is chose to be X/n, RDD 
is equivalently strong as IC, as shown in next theorem. For other choices of A, such as 
HOLP defined in next section, the estimator satisfies RDD even when predictors are highly 
correlated. Therefore, RDD is considered as weak requirement. 

For ” SIS”, the screening matrix <F = X T X/n coincides with the covariance matrix, 
making RDD and IC effectively equivalent. The following theorem formalizes this. 

Theorem 4. Let A = X T /n and standardize columns of X to have sample variance one. 
Assume X satisfies the sparse Riesz condition m, i.e, 


min Xmin ( X~ X n j71 ,) ^ jl, 

7rC Q, 17TI 

for some p > 0. Now if AX is restricted diagonally dominant with sparsity s + 1 and Cq> p 
with p > yfs/p, then X satisfies the IC for any (3 G B(s,p). 

In other words, under the condition p > y/s/p, the strong screening consistency of SIS 
for B(s + 1, p) implies the model selection consistency of lasso for B(s, p). 

Theorem Q] illustrates the difficulty of SIS. The necessary condition that guarantees 
good screening performance of SIS also guarantees the model selection consistency of lasso. 
However, such a strong necessary condition does not mean that SIS should be avoided in 
practice given its substantial advantages in terms of simplicity and computational efficiency. 
The strong screening consistency defined in this article is stronger than conditions commonly 
used in justifying screening procedures as in m 

Another common assumption in the high dimensional literature is the restricted eigen¬ 
value condition (REC). Compared to REC, RDD is not necessarily stronger due to its flexi¬ 
bility in choosing the ancillary matrix A. [13118] prove that the REC is satisfied when the 
design matrix is sub-Gaussian. However, REC might not be guaranteed when the row of 
X follows heavy-tailed distribution. In contrast, as the example shown in next section and 



in |T5], by choosing A = X T (XX T ) 1 , the resulting estimator satishes RDD even when the 
rows of X follow heavy-tailed distributions. 


5 Screening under random designs 

In this section, we consider linear screening under random designs when X and e are Gaus¬ 
sian. The theory developed in this section can be easily extended to a broader family of 
distributions, for example, where e follows a sub-Gaussian distribution [T9] and X follows 
an elliptical distribution hd m- We focus on the Gaussian case for conciseness. Let 
e ~ iV(0, cr 2 ) and X ~ iV(0, E). We prove the screening consistency of SIS and HOLP by 
verifying the condition in Theorem [2] Recall the ancillary matrices for SIS and HOLP are 
defined respectively as 


Asis = X/n, Aholp = X T (XX T ) 1 . 


For simplicity, we assume E** = 1 for i — 1, 2, ■ ■ ■ ,p. To verify the RDD condition, it is 
essential to quantify the magnitude of the entries of AX and Ae. 

Lemma 1. Let $ = A$isX, then for any t > 0 and i ^ j E Q, we have 



and 



where K = ||T 2 (1) — is a constant, T 2 (l) is a chi-square random variable with one 


degree of freedom and the norm || • is defined in fTPj/ . 

Lemma Q] states that the screening matrix = A$isX for SIS will eventually converge to 
the covariance matrix E in l when n tends to infinity and logp = o(n). Thus, the screening 
performance of SIS strongly relies on the structure of E. In particular, the (asymptotically) 
necessary and sufficient condition for SIS being strong screening consistent is E satisfying 
the RDD condition. For the noise term, we have the following lemma. 

Lemma 2. Let 77 = Asis^- For any t > 0 and i E Q, we have 
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where K is defined the same as in LemmaUi 


The proof of Lemma [2] is essentially the same as the proof of off-diagonal terms in Lemma 
CD and is thus omitted. As indicated before, the necessary and sufficient condition for SIS 
to be strong screening consistent is that £ follows RDD. As RDD is usually hard to verify, 
we consider a stronger sufficient condition inspired by Corollary |T| 

Theorem 5. Let r = niax^j |£jj|. Ifr < then for any 5 > 0, if the sample size satisfies 

n>U4K ^ + fi_ S ffi /T Jk g {3p/S), (5) 

where K is defined in LemmaUi then with probability at least 1— 5, $ = AsisX—2t~ 1 \\Asis^\\ooIp 
is restricted diagonally dominant with sparsity s and C 0 > p. In other words, SIS is screening 
consistent for any (3 e B T (s,p). 

Proof. Taking union bound on the results from Lemma [T] and [21 we have for any t > 0 and 
P> 2, 

P ( min < 1 — t or max > r + t or 11 77 11oo > crt ) < 7 p 2 exp < —— min 

V i^j j y a 

In other words, for any 8 > 0, when n > K log(7p 2 /<5), with probability at least 1 — 5, we 
have 


( 


f 2 


t 


V 72e 2 ’ 6e 


min > 1 - 6\/2e 

ieQ 


A'log(7p 2 /h) 


n 


max ( l>ij | < r + 6\/2e 

*7 


Klog(7p 2 /5 ) 


n 


R jK log(7 p 2 /5) 

max 77 * <bV2ea\ -. 

i&Q V n 

A sufficient condition for $ to be restricted diagonally dominant is that 

min <&jj > 2 ps max 1 + 2 r _1 max \rji\. 

i ijtj i 

Plugging in the values we have 

1 - eV2eJ K '° gi7p2/6) > 2ps(r + 6^2 + 12V 2 eT -^.[^MM. 

V n V n V n 

Solving the above inequality (notice that 7 p 2 /8 < 9p 2 /5 2 and p > 1) completes the proof. □ 
The requirement that max^- |£jj| < l/(psr) or the necessary and sufficient condition 
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that £ is RDD strictly constrains the correlation structure of A", causing the difficulty for 
SIS to be strong screening consistent. For HOLP we instead have the following result. 

Lemma 3. Let $ = A H qlpX. Assume p > c 0 n for some c 0 > 1, then for any C > 0 there 
exists some 0 < c\ < 1 < C 2 and C 3 > 0 such that for any t > 0 and any i G Q,j 7 ^ i, we 
have 

>ca«^ <2e~ Cn 
< 5e~ Cn + 2e" t2/2 , 

7 \/ C2(co—Cl) 

where c 4 = v , 

V c a( c o-i) 

Proof. The proof of Lemma [3] relies heavily on previous results for the Stiefel Manifold 
provided in the supplementary materials. We only sketch the basic idea here and leave 
the complete proof to the supplementary materials. Defining H = X T (XX T )^ 1 ^ 2 , then 
we have <h = IIII 1 and H follows the Matrix Angular Central Gaussian (MACG) with 
covariance £. The diagonal terms of HH T can be bounded similarly via the Johnson- 
Lindenstrauss lemma, by using the fact that HH 1 = 'E 1 ^ 2 U(U t 'EU)~ 1 UTi, where U is a 
p x n random projection matrix. Now for off-diagonal terms, we decompose the Stiefel 
manifold as H = (G(H 2 )Hi H 2 ), where Hi is a (p — n + 1) x 1 vector, H 2 is a p x (n — 1) 
matrix and G(H 2 ) is chosen so that ( G(H 2 ) H 2 ) e O(p), and show that Hi follows Angular 
Central Gaussian (ACG) distribution with covariance G(H 2 ) t T,G(H 2 ) conditional on H 2 . It 
can be shown that e 2 HH T e\ = e 2 G(H 2 )Hi\ef H 2 = 0. Let t\ = efHH 1 ei, then efH 2 = 
0 is equivalent to ejG(H 2 )Hi = ti, and we obtain the desired coupling distribution as 
efHH T e 1 = e 2 G(H 2 )Hi\efG(H 2 )Hi = t\. Using the normal representation of ACG(E), 
i.e., if x — (xi, • • • ,x p ) ~ Af(0, £), then x/||x|| ~ ACG(E), we can write G(H 2 )Hi in terms 
of normal variables and then bound all terms using concentration inequalities. □ 

Lemma [3] quantifies the entries of the screening matrix for HOLP. As illustrated in the 
lemma, regardless of the covariance £, diagonal terms of <f> are always O(^) and the off- 
diagonal terms are O(^). Thus, with n > 0(s 2 ), <h is likely to satisfy the RDD condition 
with large probability. For the noise vector we have the following result. 

Lemma 4. Let p = Aholp£- Assume p > cqu for some cq > 1, then for any C > 0 there 


~l U \ ^ r\—Cn 


P[ |$ji| < c\K — ) < 2e 


and 


P ( > C4hd 


p 
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exist the same ci, c 2 , C 3 as m Lemma [3] such that for any t > 0 and i e Q, 


P 


( |^| > < 4e _ Cn + 2e _ t2/2 ^ 
V 1 -Cq 1 V ) 


ifn> 8C/ (c 0 - l) 2 . 


The proof is almost identical to Lemma [2] and is provided in the supplementary materials. 
The following theorem results after combining Lemma [3] and |4j 

Theorem 6 . Assume p > Cqu for some Co > 1. For any 8 > 0, if the sample size satisfies 

( 6 ) 

where C' = max{%, 2 4ca _! 2 } and c\, C 2 , C 3 , C 4 , C are the same constants defined in Lemma 

C 1 c l(l~ c O ) 

0 then with probability at least 1 — 8, <3? = AholpX — 2 r 1 \\Aholp£\\ocI p is restricted 
diagonally dominant with sparsity s and C 0 > p. This implies HOLP is screening consistent 
for any fi G B r {s,p). 

Proof. Notice that if 


n > max < 2 C'K A (ps + cr/r ) 2 log(3p/<5), 


8C 


(co - l ) 2 


min |$jj| > 2spmax |$jj| + 2r 1 \\X T (XX T ) 1 e|| 00 , 

i ij 


(7) 


then the proof is complete because — 2r l \\X T (XX T ) 1 e|| 0O is already a restricted diago¬ 
nally dominant matrix. Let t = \JCn/v. The above equation then requires 

1 n 2 c^VCnspn ‘lo^lcfiCnt n , _ , ‘Ic/pJCnsp 2a^/cfCn x n 

c lK- 7Z -ZW-= (ClK--ITjT “ > 0, 

p up (1 - c q )tu p V (1 — c q )tu p 

which implies that 


2 c 4 \[CK 2 ps 2 ay/cfCn 2 2 „ 2 -i ^ , 

z/ >-1--- —r— = CiK ps + C 2 k r a > 1, 

ci ci(l - c 0 )r 

where Ci = 2c4V ^ , C 2 = 2 ,/ C2 _i . Therefore, taking union bounds on all matrix entries, we 

1 Cl ’ Z C! (1 Cq 1 ’ 

have 

does not hold}^ < (p + 5 p 2 )e~ Cn + 2 p 2 e^ Cn ^ < (7 + — )p 2 e~ Cn ^ 2 , 
where the second inequality is due to the fact that p > n and u > 1. Now for any 8 > 0, 0 
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holds with probability at least 1 — 6 if 


v 


n > — ( log (7 + 1/n) + 2 log p - log 5 ), 


which is satisfied provided (noticing \/8 < 3) n > ^ log y. Now pushing v to the limit 
gives (J6J), the precise condition we need. □ 


There are several interesting observations on equation (J5]) and (j6]). First, (ps + cr/r) 2 
appears in both expressions, suggesting this term might be a common requirement across 
all linear screening methods. We note that ps evaluates the sparsity and the diversity of 
the signal /3 while cr/r is closely related to the signal-to-noise ratio. Furthermore, HOLP 
relaxes the correlation constraint r < l/(2ps) or the covariance constraint (£ is RDD) with 
the conditional number constraint. Thus for any S, as long as the sample size is large enough, 
strong screening consistency is assured. Finally, HOLP provides an example to satisfy the 
RDD condition in answer to the question raised in Section 4. 


6 Concluding remarks 

This article studies the theoretical properties of a class of high dimensional variable screening 
methods. In particular, we establish a necessary and sufficient condition in the form of 
restricted diagonally dominant screening matrices for strong screening consistency of a linear 
screener. We verify the condition for both SIS and HOLP under random designs. In 
addition, we show a close relationship between RDD and the IC, highlighting the difficulty 
of using SIS in screening for arbitrarily correlated predictors. 

For future work, it is of interest to see how linear screening can be adapted to compressed 
sensing 125 and how techniques such as preconditioning PD can improve the performance 
of marginal screening and variable selection. 
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Appendix A: Proofs for Section 3 

In this section, we prove the two theorems in Section 3. 


Proof of Theorem [I] If $ is restricted diagonally dominant with sparsity s and G'q > p. 
we have for any / C Q and |/| < s — 1, 


A > pmax l 55 A + Aj|, 

55 lAi — Ail / + I A| 

^ 16/ 

16/ ^ 


VkfieQ\ I. 


Recall j3 = $/3. Suppose S is the index set of non-zero predictors. For any i e S, k S, of 
we fix / = S \ {z}, we have 


|A| = |AA + 55 AiAl - IAKA + 55 

Pi 


3 el 


16 / 


+E f + 4 »i + 4 « - E t 4 *; - 4 


> HAKE + * 

16 / 


16 / 

A 

A 


16 / 


A 


fcz J 


IA 


fcz / 


A 


(55 A A,- + A A*) 


16 / 


= -sign(f3i) ■ A,. 


Similarly we have 


IAI = IAA + 55 AiAl > I AI(A* + 51 iNhi 

Pi 


16 / 


16 / 


ifti(4..+ E §- ( 4 « - 4 <=i) - + E (A + 4 


16 / 


16 / 


A 


fcz J 


> IAI (55 ^Aj + A*) = S AAA) • A- 


16 / 


A 


Therefore, whatever value sign(/3i) is, it always holds that |A| > |A|- Since this result is 
true for any i 6 S, k tf: S, we have 


min |Al > max | A 
ieS 1 


To prove the sign consistency for non-zero coefficients, notice that for i E S, 

a > p(55 i a + aa + 55 ia - ‘Ail)/ 2 > p55 iai- 

16 / 16 / 16 / 
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Thus, 


PiPi = ®aPi + X QijPjPi - #($* + XI 4*v) > °- 

jei jei Pi 


On the other hand, if p is screening consistent, i.e., \Pi\ > \Pk\ and 3iPi > 0, we can construct 
S = I U {i} for any fixed i, k, I. Without loss of generality, we assume > 0. If we select 
3 such that fa > 0, then the strong screening consistency implies Pi > Pk and Pi > ~Pk- 
From Pi > p k we have 


^iiPi T ^ ^ ^ijPj ^ ^ ^kjPj T *&kiPi- 

jei jei 

By rearranging terms and selecting P G 13(s,p) as Pi = 1, Pj = —p ■ sign($ij — &kj), j £ S 
we have 


$H > - X^' “ ®*i)Pi + = dXl^- - + l $ wl- 

jei jei 

Following the same argument on /3, > — Pk with a choice of Pi = 1, Pj = —p ■ sign($ij + 
&kj),j G S we have 


I’m > P 'y \*&ij + &kj | + 
jei 

This concludes the proof. □ 

Proof of Theorem [M Proof of Lemma 3 follows almost the same as the sufficiency part 
of Theorem [TJ Notice that now the definition of P becomes 

P = X T (XX T )~ 1 XP + X T (XX T )- 1 e. 

If the condition holds, i.e., for any i G S, I — S \ {i} and k ^ S, we have 

Tu > p max <j X I ®ij + ®kj\, X 
^ jei jei 


*ij ~ * 


kj | 


+ 1 $ 


ik 


rT \-1 


2t~ 1 \\X 1 (XX 1 ) *e| 
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Defining 77 = X T (XX T ) 1 e, we have for any i G S, 


\Pi\ — I &iiPi + + 77«| > |/3j|($ji + + Pi 


3 - 1 . 


j'e/ 


j'e/ 


A 


lAK^n + VT^jj + ®fei) + + Pi 1 (.Vi + %) — — *£fci — Pi Sfc) 

jei Pi jei Pi 


Pi 


> — \Pi\ ~o~^kj + f&fc* + Pi Sfc) — Pj^kj + Pi^ki + Vk) 


\Pi 


jei & 

= —sign(Pi) ■ p k , 


Pi 


jei 


Similarly, we can prove |A| > sign(Pi) ■ p k , and thus |A| > \Pk\, which implies that 


min \Pi\ > max \ P k 
ieS 1 kgs 


The weak sign consistency is established since 


PiPi = ^uPi + Y ®ijPjPi + ViPi = Pi(®i 

jei 


Y + Pi S») > 


jei 


P 


for any Pi p 0. 

The tightness of this theorem is given by the case when e = 0, for which the condition 
has already been shown to be necessary and sufficient in Theorem [TJ □ 


Appendix B: Proofs for Section 4 


In this section, we prove results from Section 4 that are not covered in the main article. 


Proof of Corollary [U Letting I C Q, |/| < s — 1, we have for any i p k E Q\I, 

—-max | Y I ®ij + ®kj\, Y \®ij ~ ®kj\ | + |$»fc| > 1 - \{ 2 ( s _ !)^ + y s 

^ jei jei ' ^ 



This completes the proof for the first case. 

Now for the second case, notice that the sum of an entire row (except the diagonal term) 
can be bounded by < 2 YPPLi rk < Therefore, we have 


$ 


(1 — r) 2 


max <j Y | $ij + § k j |, 

^ jei 


Y l**i - $ Dl} - I ®ik\ > 1 - l $ yI - r = °- 

jei ' jgi 


□ 
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Proof of Theorem 0 First, from RDD to IC: Without loss of generality, we assume S = 
{1, 2}. For any k £ Q \ S, we have 


[$fci ®k2]®s!'s si 9n(Ps) 


sign(/3i)($ k i - $ 12 ^ 2 ) + sign(/3 2 )(-$i2$ki + ®k 2 ) 

1-^2 


The r.h.s. becomes \$ki + 1(1 — ^ 12 )/(I — $f 2 ) when sign({3i) = sign(/3 2 ) and |$fci — 

$fc 2 |(l + < f>i 2 )/(l — $ 12 ) when sign(/3i) = —sign(/3 2 ). In either case we have 


[$fci ^ 2 ] $5 s sign{(3 s ) 


< 


max { |$i fe + $ 2 fc|, l^ifc - $ 


2fc 


1 — r 


< 


P 


-1 


1 — r 


Second, from IC to RDD: Let I C Q, |/| = 1 and i =fk E Q\I. Without loss of generality, 
we assume i — l,k — 2, and we construct S' = {1,2}. Now for any j E /, using the same 
formula as shown above, we have 

sign^^ji - $12$j 2 ) + si 5 fn(/ 3 2 )(-$i 2 < l ) ji + $72) 

1 - *?2 

Using the same result on the r.h.s., i.e., it becomes |$fci + ^*fc 2 1(1 — $i 2 )/(l — $ 12 ) when 
signal) = sign(/3 2 ) and | < F/ C i — $fc 2 |(l + <^> 12 )/(l — $ 12 ) when sign(j3i) = —sign((3 2 ), we have 
for any j E I that 


1-6 > 


[$ji $j2]$s, s si 9 n (Ps 


max |$ y + <F 2 j|, |$y - f h 2 j \ < (1 - 0)( 1 + r). 


As a result, we have 


jei 


max 


|$ lj + $ 2 j |,|$i J -$ 2 j|| < (l- 0 )(l + r) < - |<Fi 2 |^, 


which implies 


]_ 1 — r _> 

$11 > -- —— 2 jmax <j T1 j + $ 2 j|, \ $ij ~ ® 2 j\ 


1 — 6 1 + r 


ie 1 


1 $ 


12 


□ 


Proof of Theorem [4} We just need to check (00). We prove the absolute value of the 
first coordinate of Cs^, sC^g ■ sign(/3g ) is less than one, and the rest just follow the same 
argument. From the condition we know C = X T X/n is restricted diagonally dominant. 
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Then equation (j3J) implies that for any / C Q with |/| = s, we have for any k (jL /, 

p^2\c ki \ < i. 

i&I 

Now for any S C Q with (S') = s, we choose I = S and let a T be the first row of Cs y s — 
Xg c X s /n, we have 


\a T (XgX s /n) 1 sign(p s )\ < \\a\\ 2 \\sign(Ps)\\ 2 ii 
Because |a,| < 1, we have 

p 2 J2 a l < p 2 (Yl M) 2 < 

2—1 2 — 1 

which implies that 

\a T (XlX s /n)~ l sign(l3s)\ < p^y/sp' 1 = — < 1. 

PP 

□ 


Appendix C: Proofs for Section 6 (SIS) 


Proofs in Section 6 are divided into two parts. In this section, we provide the proofs related 
to SIS, and leave those pertaining to HOLP to the next section. The proof requires the 
following proposition, 


Proposition 1 . Assume Xi ~ fb 2 (l ),i = 1,2, ■■■ ,n, where fb 2 (l) is the chi-square distri¬ 
bution with one degree of freedom. Then for any t > 0, we have 


P (- 1 | > t) < 2 exp <| - min 


n 


f t 2 n tn ^ 

\8e 2 K’2n<) 


where K = ||A’ 2 ( 1 ) — ly^. Alternatively, for any C > 0, there exists some 0 < C 3 < 1 < C 4 
such that, 


p(Ei= i*i < Cs) < e -cn j 


n 


( 8 ) 
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and 


p( Si=l*» > C4 ) < e -Cn . 


n 


Proof. It is a direct application of Proposition 5.16 in [19]. Notice that in the proof of 
Proposition 5.16 we have C = 2e 2 and c = e/2 for df 2 (l) — 1. □ 


Proof of Lemma Q} For diagonal term we have for any i G {1, 2, • • • , p} 

V" r 2 

^ v' Z-^k =1 ik -| 

^*22 ^22 -*"5 

n 

where k — 1, 2, • • • , n’s are n iid standard normal random variables. Using Proposition 
CD we have for any t > 0 , 

p(|*«-£«|>i)<2exp{-min(^,2^)}. (9) 

For the off-diagonal term, we have for any i j, 

E n 

k =1 yi 

^ij 

n 

EL.+ EL.4 Skfl* E 

2 n 2 n 2 n u 

= 1 _ (2 + 2Eij) ) _ i 

Notice that + Xjk ~ 1V(0, 2 + 2E^). Hence the three terms in the above equation can be 
bounded using the same inequality before, i.e., for any t > 0 , 

P (Vq - £y| > (2 + < 6 exp | - min ^ |. 

Clearly, we have E ?;j - < v^E^y^E^ < 1. Therefore, we have 

p ( |4 « “ E «' s *) £ 6exp { “ “ ml (t Bit'H )}' 

□ 


-1 - 


i (Yl=iA 


jk 


- 1 


n 


Proof of Lemma [H The proof is essentially the same for proving the off diagonal terms 
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of <f> as in Lemma [TJ The only difference is that E(<&ij) = while E{Xe) = 0. Note 


Vi/^ = 


ELi W* _ ELi(*« + u» 2 ELi 4 ELi 4/^ 2 


n 


2n 


2 n 


2 n 


Using Proposition [TJ we have 

p(\rji/ a \ > ^ < 6 exp | — min ^ 


/ t 2 n tn 
72e 2 K , 6eK 


□ 


Now we turn to the proof of Theorem [5j 


Proof of Theorem [5j Taking union bound on the results from Lemma [T| and [2J we have 
for any t > 0, 


P ( min < 1 — t ] <2 p exp <> — min ( —.- I 


ie Q 


\8 e 2 K 1 2eK J 


P[ max |$jj| > r + t ) < 6(p 2 — p) exp 

V J 


( t 2 n tn 
mm (j 2e 2 K' QeK 


and 


P[ max |77i| > at ) < 6pexp 


/ t 2 n tn 

mm --—,- 

\72e 2 K 6eK 


Thus, when p > 2 we have 


P 


min $jj < 1 — t or max > r + t or max I nA > at 
i&Q i^j ieQ 


< 7p 2 exp 


— min 


t 2 n tn 
72e 2 I\ ' (ieE 


In other words, for any 8 > 0, when n > K log(7p 2 /h), with probability at least 1 — 8, we 
have 


min $jj > 1 - 6\/2e 

ieQ 


Klog(7p 2 /8 ) 


n 


max | < f)j 1 | < r + 6\/2e 
i¥=j 


Klog(7p 2 /8 ) 


n 


and 


max 

i&Q 


M < 6 V2eaJ K '° e{7p2/5 \ 

V n 


22 






















A sufficient condition for <f> to be restricted diagonally dominant is that 


min > 2ps max |<4 j | + 2r 1 max \ry\. 

i i^j i 

Plugging in the values and solving the inequality, we have (notice that 7 p 2 / 5 < 9p 2 /5 2 ) $ is 
RDD as long as 

This completes the proof. □ 


Appendix D: Proofs for Section 6 (HOLP) 

In this section we prove Lemma [3l [4] and Theorem [5j Several propositions and lemmas are 
needed for establishing the whole theory. We list all prerequisite results without proofs but 
provide readers references for complete proofs. 

Let P G O(p) be a p x p orthogonal matrix from the orthogonal group O(p). Let H 
denote the first n columns of P. Then PI is in the Stiefel manifold [22]. In general, the 
Stiefel manifold V ntP is the space whose points are n-frames in 1Z P represented as the set of 
p x n matrices X such that X T X = I n . Mathematically, we can write 

v n . p = {Xe R pxn : X T X = 4}. 

There is a natural measure (dX) called Haar measure on the Stiefel manifold, invariant under 
both right orthogonal and left orthogonal transformations. We standardize it to obtain a 
probability measure as [dX] = (dX)/V(n,p), where V(n,p ) = 2 n n np / 2 /T n (l/2p). 

Lemma 5. 122 1 Page 41-44] Supposed that apxn random matrix Z has the density function 
of the form 

fz(Z) = \Z\~ n l 2 g(Z T T,~ 1 Z), 

which is invariant under the right-orthogonal transformation of Z, where £ is a pxp positive 
definite matrix. Then its orientation H z = Z(Z T Z)~ l l 2 has the matrix angular central 
Gaussian distribution (MACG) with a probability density function 

MACG(T) = \T\~ n l 2 \H^Tr l H z \~ p l 2 . 

In particular, if Z is a p x n matrix whose distribution is invariant under both the left- and 
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right-orthogonal transformations, then Hy, with Y = BZ for BB J = E, has the MACG(T ,) 
distribution. 

When n — 1 , the MACG distribution becomes the angular central Gaussian distribution, 
a description of the multivariate Gaussian distribution on the unit sphere |23j . 

Lemma 6 . 12R Page 70, Decomposition of the Stiefel manifold] Let H be a p x n random 
matrix on V njP , and write 


H = (if, H 2 ), 


with H\ being a p x q matrix where 0 < q < n. Then we can write 

H 2 = G(Hi)U u 

where G(H i) is any matrix chosen so that (Hi G(H i)) e 0(p); as H 2 runs over V n - qjP , U\ 
runs over V n - qtP - q and the relationship is one to one. The differential form [dH] for the 
normalized invariant measure on V niP is decomposed as the product 

[dH] = [dHi][dUi] 

of those [dH i] and [dUf on V q)P and V n _ q ^ p _ q , respectively. 

Lemma 7. [Lemma f in 11 If] Let U be uniformly distributed on the Stiefel manifold V n>p . 
Then for any C > 0, there exist c\ , c' 2 with 0 < c\ < 1 < c' 2 , such that 

p[elUU T ei < c[^j < 2e~ Cn , 

and 

p[elUU T ei > < Ae~ Cn . 

Some of our proof requires concentration properties of a random Gaussian matrix and 
Xf random variables. For a Wigner matrix, we have the following result. 

Lemma 8 . Assume Z is a n x p matrix with p > c^n for some Cq > 1. Each entry of Z 
follows a Gaussian distribution with mean zero and variance one and are independent. Then 
for any t > 0, with probability at least 1 — 2exp(— t 2 /2), we have 

(1 - Cq 1 - t/p ) 2 < A min (ZZ T /p) < \ max (ZZ T /p) < (1 + Cq 1 + t/pf. 
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For any C > 0, taking t = \j2Cn, we have with probability 1 — 2 exp(— Cn/2), 

(1 - Cq 1 - < A min (ZZ T /p) < (1 + Cq 1 + ^p) 2 . 

c 0 \/n c oy /n 

Proof. This is essentially Corollary 5.35 in [19]. □ 

The conditional number of E is controled by n, which simulaneously controls the largest 
and the smallest eigenvalues. 

Proposition 2. Assume the conditional number of E is k and E u = 1 for i = 1, 2, • • • ,p, 
then we have 


^ and F: &• 

Proof. Notice that p = tr( E) = Therefore, we have 

p/Xmax > P fiC 1 and p/ A min (E) < pn, 

which completes the proof. □ 


Now we prove the main results for HOLP. 

Proof of Lemma 0 Consider a transformed n x p random matrix Z = XE -1 / 2 , which, 
by definition, follows standard multivariate Gaussian. Consider its SVD decomposition, 

z = vdu t , 

where V G O(n), D is a diagonal matrix and U is a p x n random matrix belonging to the 
Stiefel manifold V niP . With such notion, we can rewrite the projection matrix as 

X T (XX T )~ 1 X = E 1/2 f/(C T E[/)- 1 [/ T E 1/2 = HH t , 

where H = E 1 / 2 f/(?7 T Ef/)~ 1 / 2 and H G Therefore, the two quantities that we are 

interested in are = ejHH 1 e* (diagonal term) and = efHH T ej (off-diagonal term), 
where ef is the p— dimensional unit vector with the i th coordinate being one. The proof is 
divided into two parts, where in the first part we consider diagonal terms and the second 
part takes care of off-diagonal terms. 
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Part I: First, we consider the diagonal term efHH T ei. Recall the definition of H and 

efHH T ei = 

There always exists some orthogonal matrix Q that rotates the vector to the direction 
of ei, i.e, 

= ||£^||Qei. 

Then we have 

efHH T ei = \\^e i \\ 2 e T l Q T U(U T m)- 1 U T Qe 1 = ||S^|| 2 e[t/(f7 T E[/)- 1 t/ei, 

where U = Q T U is uniformly distributed on 14 , P , because U is uniformly distributed on 14 lP 
(see discussion in the beginning). Now the magnitude of ef HH T ei can be evaluated in two 
parts. For the norm of the vector Y>^v, we have 

A mi „(S) < efSej = ||£^e)i|| 2 < A m(M (£), (10) 


and for the remaining part, 

e^(P T S[/)- 1 f/e 1 < X max ((U T 'EU)~ 1 )\\Uei\\ 2 < A min (S)- 1 ||[7e 1 || 2 , 


and 

e T l U{U T m)- l Ue l > \ rnin ((U T 'EU)- 1 )\\Uei\\ 2 > A max (S)- 1 ||[7e 1 || 2 . 
Consequently, we have 

„T ZJ uT „ ^ A maxfy) TttttT„ „T tt ttT „ ^ A minfp) TttttT , 


ei HH 1 e,- < 


-ej UU ei, e‘ HH‘e t > 


A mi „(S) 1 1 - A max (E) 

Therefore, following Proposition [2 for any C > 0 we have 


e{ UU ei. 


( 11 ) 


p[efHH T ei < < 2e~ Cn , 

and 

P^efHH T ei > ^ < 2e~ Cn . 

Denoting 4 C 4 by Ci and c^cj 1 by c 2 , we obtain the equation in Lemma EJ 

Part II: Second, for off-diagonal terms, although the proof is almost identical to the 
proof of Lemma 5 in [15], we still provide a complete version here due to the importance of 
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this result. 

The proof depends on the decomposition of Stiefel manifold. Without loss of generality, 
we prove the bound only for e^HH T e\, then the other off-diagonal terms should follow 
exactly the same argument. According to Lemma El we can decompose H = (T), H 2 ) with 
T\ = G(i7 2 )L/i, where H 2 is a p x (n — 1) matrix, H\ is a (p — n + 1) x 1 vector and G(H 2 ) is 
a matrix such that (G(L7 2 ), H 2 ) G 0(p). The invariant measure on the Stiefel manifold can 
be decomposed as 


\H) = [ff,][ff 2 ] 

where [Hi] and [ H 2 ] are Haar measures on V^n-p+i, W-i, P (Notice that q — n — 1 in this 
decomposition) respectively. As pointed out before, H has the MACG(T ,) distribution, 
which possesses a density as 


p(H) oc |iL T E -1 iL| -p / 2 [dH]. 


Using the identity for matrix determinant 


A B 
C D 


\A\\D - CA^B | = \D\\A - BD~ l C |, 


we have 


= \H^- 1 H2\- p / 2 (H?G(H 2 n^ - E- 1 H2(HjE- 1 H 2 )- 1 H^- 1 )G(H2)H 1 )- p/2 

= \HjE- 1 H 2 \- p/2 (HlG(H2) T ^- 1/2 (I - T 2 )E- 1/2 G(i/ 2 )i7 1 )- ?,/2 , 

where T 2 = E _1 / 2 iL 2 (iLjE _1 i/ 2 )" 1 iJj’E _1 / 2 is an orthogonal projection onto the linear space 
spanned by the columns of E -1//2 // 2 . It is easy to verify the following result by using the 
definition of G(H 2 ), 

[E 1 / 2 G(/J 2 )(G(// 2 ) T EG(iJ 2 ))- 1/2 , E -1 / 2 if 2 (iLjE -1 if 2 ) -1 / 2 ] G 0(p), 
and therefore we have 


I-T 2 = E 1 / 2 G(// 2 )(G(iL 2 ) T EG(i/ 2 ))- 1 G(i/ 2 ) T E 1 / 2 , 
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which simplifies the density function as 


P(H u H 2 ) oc |i/ 2 T S- 1 /J 2 |- p / 2 (/J 1 r (G'(i/2) T SG(i/ 2 ))- 1 i/ 1 )- p/2 . 

Now it becomes clear that Hi\H 2 follows the Angular Central Gaussian distribution ACG{T!), 
where 


S' = G(H 2 ) T EG(H 2 ). 

Next, we relate the target quantity ejHH T e 2 to the distribution of Hi. Notice that for 
any orthogonal matrix Q £ O(n), we have 


e{HH 1 e 2 = eiHQQ T H T e 2 = e{H'H' l e 2 . 


TJ / rr'T, 


Write H' = HQ = ( T[,H' 2 ), where T' = [t[ {1 \ t' (2) , ■ ■ ■ H! 2 = [H^ j \ If we choose 

Q such that the first row of H 2 are all zero (this is possible as we can choose the first column 
of Q being the first row of H upon normalizing), i.e., 

elH' = [T; (1) , 0, • • • , 0] e\H' = H™, ■ • • , 


then immediately we have e{ HH 1 e 2 = e[H'H T e 2 = T^’T^'. This indicates that 


'(1)W(2) 


e\HH T e 2 = T 1 (1) T 1 (2) 


elH 2 = 0 . 


As shown at the beginning, H\ follows ACG{T!) conditional on H 2 . Let H\ = (hi, h 2 , ■ ■ ■ , h p 
and let x 1 = (xi, x 2 , ■ ■ ■ , x p _ n+ i) ~ N(0, S'), then we have 


> (d) 
hi = 




x 2 H-h x 2 _ 


p—n+l 


Notice that T\ = G(H 2 )H i, a linear transformation on LA. Defining y = G(H 2 )x, we have 


T (i) (J 


Vi 


A?+— ; "p 


( 12 ) 


where y ~ A r (0, G(H)T!G(H) T ) is a degenerate Gaussian distribution. This degenerate 
distribution contains an interesting form. Letting z ~ iV(0, £), we know y can be expressed 
as y = G(H)G(H) T z. Write G(H 2 ) T as [cq, g 2 ] where is a (p — n + 1) x 1 vector and g 2 is 







a (p — n + 1 ) x (p — 1 ) matrix, then we have 


G(H 2 )G(H 2 ) T = ( 9 l 91 9 l 92 ) . 

\92 9i 92 92) 

We can also write Hj = [On- 1 , 1 , h 2 \ where h 2 is a (n — 1) x (p — 1) matrix, and using the 
orthogonality, i.e., [H 2 G(H 2 )][H 2 G(H 2 )} T = I p , we have 

9i9i = 1, 9i92 = Oi, p -i and g 2 g 2 = J p _i - h 2 h T 2 . 


Because h 2 is a set of orthogonal basis in the p — 1 dimensional space, g 2 g 2 is therefore an 
orthogonal projection onto the space {fa} 1 - and g 2 g 2 = AA T where A = g 2 (g 2 g 2 fa 1 ^ 2 is a 
(p — 1) x (p — n) orientation matrix on {fa}^. Together, we have 


V = 


1 ° ) 
0 AA T ) 


This relationship allows us to marginalize y\ out with y following a degenerate Gaussian 
distribution. 

We now turn to transform the condition ejH 2 — 0 onto constraints on the distribution of 
T[ l) . Letting if = e[HH T e i, then ef H 2 — 0 is equivalent to 7^ 1)2 = e^HH T e i = if, which 
implies that 


(d) 




e [ HH e 2 = Tf 'Tf 


T, 2 = e\HH T e x . 


Because the magnitude of ejHH T e± has been obtained in Part I, we can now condition on 
the value of effHH T e± to obtain the bound on T-f 2> . From = if, we obtain that, 


c l-tf)yf = tf(yl + yl + --- + yl )■ ( 13 ) 

Notice this constraint is imposed on the norm of y — (y 2 , y 3 , ■ ■ ■ ,y p ) and is thus independent 
of (y 2 /\\y\\, ■ ■ ■ , y p / 1| y ||). Equation (fT3l) also implies that 

{l-tl){yl + y 2 2 + --- + yl)=yl + yl + --- + y 2 p . ( 14 ) 
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Therefore, combining (TT2T) with (TT3|) . (fl4|) and integrating y\ out, we have 

r P < A I yd 1 ) _ f (£) ^ j — o q ... „ 

1 1 | 1 ^ 6x —-. 6 O, , 

^2 2 + ---+l/ p 2 


where ( 1 / 2 , y 3 , • • • , y p ) ~ iV(0, AA t Tj 22 AA t ) with £ 2 2 being the covariance matrix of z 2 , ■ • • , 
To bound the numerator, we use the classical tail bound on the normal distribution as 
for any t > 0, (a* = y/varfa) < yf \ max (AA T Yi 22 AA T ) < A max (£) 1/2 ), 


P{\Ui\ > tai) = P(\yi\ > tA^ax(S)) < 2 e * 2/2 . 


(15) 


For the denominator, letting z ~ iV(0,/ p _i), we have 


p—n 


y = AA t T}J 2 z and y T y = AA t T, 1 2 ! 2 2 z = ^ 


i= 1 


where A( 2 (l) are iid chi-square random variables and A, are non-zero eigenvalues of matrix 
Y}^AA t Y}£- Here Aj’s are naturally upper bounded by A ma 3 ,(S). To give a lower bound, 
notice that Y^ 2 AAFy}^ and AT, 22 A t possess the same set of non-zero eigenvalues, thus 

min A* > A min (A'E 22 A T ) > A 

min (£). 


Therefore, 


x ^ELT^ 2 (! )^y T y^, ^Ef=f^ 2 (i) 


p — n p — n 


p — n 


The quantity ^ ,= ^ n ‘ —- can be bounded by Proposition CD Combining with Proposition [21 


we have for any C > 0, there exists some C 3 > 0 such that 


p[y T y/{p-n) < c 3 A^£)) < e~ c ^- n \ 


Therefore, noticing that Amaa;(S)/A^{ 2 n (E) = k 1 / 2 , T[ 2) can be bounded as 
p(\T[ 2) I > v/l M 1 ) = ti') < e~ cip - n) + 2e~ t2/2 . 

\ y/csVP ~ n J 
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Using the results from the diagonal term, we have 


P[t{> c 2 n— j < 2e Cn . and P [f 2 < C\K *— ) < 2e 


—Cn 


Consequently, we have 


P( \e±HH T e 2 \ > c 4 Kt 


V 




> C±Kt —— |T 1 (1) = ti 
p 


< p(t[ 1)2 > c 2 k- \T\ l> = 

\ V 

< 5e~ Cn + 2e~ t2/2 : 


i» = u) + p(rif>\> Kit fz^i /p \ii 
J \ y/Oi^Jp-n 


(1) = U 


where c 4 = H , □ 

V C 3(C0-1) 

Proof of Lemma [^j Notice that conditioning on X , for any fixed index i, ef X T (X X T )~ 1 e 
follows a normal distribution with mean zero and variance <7 2 ||efW r (XX T ) _1 |||. We can 
first bound the variance and then apply the normal tail bound (fT5T) again to obtain an upper 
bound for the error term. 

The variance term follows 


o 2 efX T (XX T )- 2 X ei < o 2 A max ((XX T )~ 1 )e[HH T e i . 


The efHH T ei part can be bounded according to Lemma El while the first part follows 


U((« T )-') = A m ((ZT,Z T r l ) < A-‘„(ZZ t )A-L(E) = P^ in (j>-'ZZ T ). 

Thus, using Lemma [8] and El we have 


o 2 \\e?X T (XX T )~X < 


4 o 2 c 2 nn 2 
( 1 -Co 1 ) 2 ^’ 


(16) 


with probability at least 1 — 4 exp {—Cn) if n > SC / (c 0 — l) 2 . Now combining (1TE]) and (1131) 
we have for any t > 0, 


P 


(\eJX T (XX T )~ 1 e\ > 2a Vwf Vn \ < 4e -Cn + 2e -^/ 2 . 
V 1 c o P J 


□ 


Proof of Theorem Q3 The proof depends on Lemma El and O and a careful choice of the 
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value of t in these two lemmas. We first take union bounds of the two lemmas to obtain 


n s 


P(min Idyd < cik x —) < 2pe 
ieQ p 


-Cn 


P(max |<f) i7 | > c 4 Kt < 5 (p 2 — p)e Cn + 2 (p 2 — p)e * 2 ^ 2 , 
P 


and 


P( < 4 pe~ Cn + 2 pe- f !\ 

i -cq 1 p ; 


Notice that once we have 


min |$jj| > 2spmax |$jj| + 2r 1 \\X T (XX T ) 1 e|| 00 , 

i ij 


(17) 


then the proof is complete because <f> — 2r 1 \\X T (XX T ) 1 e|| 0O is already a restricted diago¬ 
nally dominant matrix. Let t = \fCnjv. The above equation then requires 


C\K 


-1 


n 2 c 4 VCksp n 2a^/c 2 Cnt n 
p up (1 - Cq 1 )tv p 

, , 2 c 4 \/Ckso 2 o\J c 2 Ck . n 

= (cik- - -ry— - > 0 , 

v (1 — c 0 )tv p 


which implies that 


2 c 4 VCn 2 ps 2 o^/c 2 Ck 2 2 , „ 2 -i 

v > -— H-— 5 --r— = C 4 K 2 ps + C 2 K Z T V > 1, 

Cl Ci(l - c 0 )T 


(18) 


where C\ = 2c \ , C 2 = - ) ■ Therefore, the probability that dT7l) does not hold is 


P 


(HU) does not hold}^ < (p + 5p 2 ) e - Cn + 2 p 2 e~ Cn/v < (7 + i )p 2 e~ Cnlv \ 


where the second inequality is due to the fact that p > n and u > 1. Now for any 5 > 0, 
(ED holds with probability at least 1 — 5 requires that 


v 


n > — ( log(7 + 1/n) + 2 logp - log 5 ), 
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which is certainly satisfied if (notice that \/8 < 3), 


2z/ 2 3 p 


Now pushing v to the limit as shown in (TTSli gives the precise condition we need, i.e. 

n > 2 C'n i (ps + r - V) 2 log 

o 

where C' = max{^ 

L 1 6 ll i ~ c O ) 


□ 
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